Some imports:
In [ ]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
try:
import seaborn
except ImportError:
pass
pd.options.display.max_rows = 10
The "group by" concept: we want to apply the same function on subsets of your dataframe, based on some key to split the dataframe in subsets
This operation is also referred to as the "split-apply-combine" operation, involving the following steps:
Similar to SQL GROUP BY
The example of the image in pandas syntax:
In [ ]:
df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})
df
Using the filtering and reductions operations we have seen in the previous notebooks, we could do something like:
df[df['key'] == "A"].sum()
df[df['key'] == "B"].sum()
...
But pandas provides the groupby
method to do this:
In [ ]:
df.groupby('key').aggregate('sum') # np.sum
In [ ]:
df.groupby('key').sum()
Pandas does not only let you group by a column name. In df.groupby(grouper)
can be many things:
In [ ]:
df.groupby(lambda x: x % 2).mean()
These exercises are based on the PyCon tutorial of Brandon Rhodes (so all credit to him!) and the datasets he prepared for that. You can download these data from here: titles.csv
and cast.csv
and put them in the /data
folder.
cast
dataset: different roles played by actors/actresses in films
In [ ]:
cast = pd.read_csv('data/cast.csv')
cast.head()
In [ ]:
titles = pd.read_csv('data/titles.csv')
titles.head()
In [ ]:
# %load snippets/04b - Advanced groupby operations8.py
In [ ]:
# %load snippets/04b - Advanced groupby operations9.py
In [ ]:
# %load snippets/04b - Advanced groupby operations10.py
In [ ]:
# %load snippets/04b - Advanced groupby operations11.py
In [ ]:
# %load snippets/04b - Advanced groupby operations12.py
In [ ]:
# %load snippets/04b - Advanced groupby operations13.py
In [ ]:
# %load snippets/04b - Advanced groupby operations15.py
Sometimes you don't want to aggregate the groups, but transform the values in each group. This can be achieved with transform
:
In [ ]:
df
In [ ]:
df.groupby('key').transform('mean')
In [ ]:
def normalize(group):
return (group - group.mean()) / group.std()
In [ ]:
df.groupby('key').transform(normalize)
In [ ]:
df.groupby('key').transform('sum')
In [ ]:
# %load snippets/04b - Advanced groupby operations21.py
Tip: you can to do a groupby twice in two steps, once calculating the numbers, and then the ratios.
In [ ]:
# %load snippets/04b - Advanced groupby operations22.py
In [ ]:
# %load snippets/04b - Advanced groupby operations23.py
In [ ]:
# %load snippets/04b - Advanced groupby operations24.py
Python strings have a lot of useful methods available to manipulate or check the content of the string:
In [ ]:
s = 'Bradwurst'
In [ ]:
s.startswith('B')
In pandas, those methods (together with some additional methods) are also available for string Series through the .str
accessor:
In [ ]:
s = pd.Series(['Bradwurst', 'Kartoffelsalat', 'Sauerkraut'])
In [ ]:
s.str.startswith('B')
For an overview of all string methods, see: http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling
In [ ]:
# %load snippets/04b - Advanced groupby operations29.py
In [ ]:
# %load snippets/04b - Advanced groupby operations30.py
In [ ]:
# %load snippets/04b - Advanced groupby operations31.py
In [ ]:
# %load snippets/04b - Advanced groupby operations32.py
A useful shortcut to calculate the number of occurences of certain values is value_counts
(this is somewhat equivalent to df.groupby(key).size())
)
For example, what are the most occuring movie titles?
In [ ]:
titles.title.value_counts().head()
In [ ]:
# %load snippets/04b - Advanced groupby operations34.py
In [ ]:
# %load snippets/04b - Advanced groupby operations35.py
In [ ]:
# %load snippets/04b - Advanced groupby operations36.py
In [ ]:
# %load snippets/04b - Advanced groupby operations37.py
In [ ]:
# %load snippets/04b - Advanced groupby operations38.py
In [ ]:
# %load snippets/04b - Advanced groupby operations39.py
In [ ]:
# %load snippets/04b - Advanced groupby operations40.py
In [ ]:
# %load snippets/04b - Advanced groupby operations41.py
In [ ]:
# %load snippets/04b - Advanced groupby operations42.py
In [ ]: